Topic Modelling Experiments on Hellenistic Corpora

نویسندگان

  • Ryder Wishart
  • Prokopis Prokopidis
چکیده

The focus of this study is Hellenistic Greek, a variation of Greek that continues to be of particular interest within the humanities. The Hellenistic variant of Greek, we argue, requires tools that are specifically tuned to its orthographic and semantic idiosyncrasies. This paper aims to put available documents to use in two ways: 1) by describing the development of a POS tagger and a lemmatizer trained on annotated texts written in Hellenistic Greek, and 2) by representing the lemmatized products as topic models in order to examine the effects of a) automatically processing the texts, and b) semi-automatically correcting the output of the lemmatizer on tokens occurring frequently in Hellenistic Greek corpora. In addition to topic models, we also generate and compare lists of semantically related words.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Building and Modelling Multilingual Subjective Corpora

Building multilingual opinionated models requires multilingual corpora annotated with opinion labels. Unfortunately, such kind of corpora are rare. We consider opinions in this work as subjective or objective. In this paper, we introduce an annotation method that can be reliably transferred across topic domains and across languages. The method starts by building a classifier that annotates sent...

متن کامل

Topic Stability over Noisy Sources

Topic modelling techniques such as LDA have recently been applied to speech transcripts and OCR output. These corpora may contain noisy or erroneous texts which may undermine topic stability. Therefore, it is important to know how well a topic modelling algorithm will perform when applied to noisy data. In this paper we show that different types of textual noise will have diverse effects on the...

متن کامل

Evaluating a Topic Modelling Approach to Measuring Corpus Similarity

Web corpora are often constructed automatically, and their contents are therefore often not well understood. One technique for assessing the composition of such a web corpus is to empirically measure its similarity to a reference corpus whose composition is known. In this paper we evaluate a number of measures of corpus similarity, including a method based on topic modelling which has not been ...

متن کامل

Adaptive topic - dependent language modelling using word - based varigrams

This paper presents two extensions of the standard interpolated word trigram and cache model, namely the extension of the trigram model by useful word m{grams with m > 3 resulting into a varigram model , and the addition of topic{speciic trigram models. We give the criteria for selecting useful m{grams and for partitioning the training corpus into topic{ speciic subcorpora. We apply both extens...

متن کامل

Lau, Jey Han, David Newman and Timothy Baldwin (to appear) On Collocations and Topic Models, ACM Transactions on Speech and Language Processing

We investigate the impact of pre-extracting and tokenising bigram collocations on topic models. Using extensive experiments on four different corpora, we show that incorporating bigram collocations in the document representation creates more parsimonious models and improves topic coherence. We point out some problems in interpreting test likelihood and test perplexity to compare model fit, and ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017